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INTRODUCTION 


Breast  cancer  is  a  leading  cause  of  death  in  women,  causing  an  estimated  -40,000  deaths  per 
year.  Mammography  is  the  most  effective  method  for  the  early  detection  of  breast  cancer,  and  it  has 
been  shown  that  periodic  screening  of  asymptomatic  women  does  reduce  mortality[4].  Many  breast 
cancers  are  detected  and  referred  for  surgical  biopsy  on  the  basis  of  a  radiographically  detected  mass 
lesion  or  cluster  of  microcalcifications.  Useful  interpretation  in  mammography  depends  on  the 
quality  of  the  mammographic  images  and  the  ability  of  the  radiologists  who  interpret  those  images. 

In  addition  to  mammography,  follow  up  imaging  with  the  use  of  other  modalities,  such  as  MRI,  and 
ultrasound  are  used  for  assessing  malignancy  of  objects  discovered  following  routine  screenings. 

The  long-term  goal  of  this  research  is  to  improve  breast  cancer  diagnosis,  risk  assessment,  response 
assessment,  and  patient  care  via  the  use  of  large-scale,  multi-modality  computerized  image  analysis. 
The  central  hypothesis  of  this  research  is  that  large-scale  image  analysis  for  breast  cancer  research 
will  yield  improved  accuracy  and  reliability  when  optimized  over  multiple  features  and  large  multi- 
modality  databases.  In  recent  years,  data  mining  and  data  driven  discovery  have  become  important 
research  tools  in  many  disciplines.  Massive  amounts  of  data  may  contain  hidden  structure  and  rich 
information,  previously  unavailable  for  characterization  within  smaller  subgroups.  Systematic 
search  can  reveal  that  structure  and  information.  In  our  context,  the  opportunity  is  as  follows:  The 
digital  age  of  medical  imaging  provides  an  ever-growing  archive  of  data.  Deep  analysis  of  this 
multimodal  imaging  data  can  be  used  to  train  and  optimize  algorithms  that  are  incorporated  into 
usable  clinical  systems,  thus  improving  overall  breast  imaging  interpretation  and  patient  outcome. 
Data  mining  can  also  enable  relational  discoveries  between  image  data  and  cancer  diagnosis, 
response,  and  outcome,  thus  adding  to  the  potential  for  “patient-specific  diagnoses  leading  to  patient- 
specific  management.”  Aspects  of  optimization  in  this  process  of  CADx  development,  were 
previously  infeasible  due  to  massive  data  and  computation  requirements.  However,  now  with 
advances  in  Grid-based  computing  many  research  avenues  exist. 

This  reports  covers  the  important  initial  developments  in  research  accomplished  in  the  past 
year  leading  towards  these  long  term  objectives.  During  this  first  year  of  research  activity,  in 
addition  to  executing  a  proof  of  principle  Grid-based  computing  work-flow,  the  recipient  was  mainly 
focused  on  identifying  suitable  theoretical/analytical  tools  for  carrying  out  larger  scale  investigation 
as  described  above,  and  conducting  preliminary  evaluations  of  the  usefulness  of  these  new  methods. 
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BODY 


Training  Accomplishments 

At  the  time  of  this  report,  the  recipient  of  the  Predoctoral  Traineeship  Award,  Andrew  R. 
Jamieson,  has  completed  17  out  of  the  20  required  courses  towards  the  Ph.D.  degree  in  Medical 
Physics.  The  remaining  three  courses  will  be  finished  by  2010  Fall,  at  the  latest.  Some  recent 
courses  completed  within  the  past  year  include  Machine  Learning,  Anatomy  of  the  Body,  Health 
Physics,  and  a  teaching  assistantship  in  Computer  Vision  and  Image  Processing. 

Research  Accomplishments 


1.  Designed  and  Successfully  Executed  Proof  of  a  Principle  Grid-based  Breast  CADx 

Images  Analysis  Work-flow  using  Swift  Script. 

We  designed  and  executed  a  pilot  study  to  utilize  large  scale  parallel  grid  computing  to  harness 
the  nationwide  cluster  infr astructure  for  optimization  of  medical  image  analysis  parameters.  A 
previously  developed  CAD  scheme  for  mass  lesions  in  mammography  was  ported  onto  the  grid 
computing  environment  by  wrapping  the  algorithm  code  with  the  Swift  script  workflow  langauge. 
The  CAD  scheme  was  then  configured  into  a  parallelizable  workflow  by  the  grid-software.  The 
workflows  were  executed  using  two  test  clusters  (in  Santa  Monica,  CA  and  Chicago,  IL)  consisting 
of  over  220  dual-CPU  nodes  combined.  Using  the  grid-environment  workflow,  parameter  sweeps 
were  conducted  for  lesion  segmentation  settings  based  on  radial-gradient-index  (RGI)  methods. 
Specifically,  the  Gaussian  width  (GW)  used  in  initially  filtering  lesion  images  for  segmentation  was 
varied  by  increments  of  1  mm  from  1  to  60  mm.  For  each  GW  sweep  the  entire  850  biopsy-proven 
mass  lesion  database  (411  benign,  439  malignant)  was  analyzed.  In  each,  29  different  mathematical 
descriptor  features  were  calculated,  followed  by  feature  selection  and  merging  with  linear 
discriminate  analysis.  Diagnostic  performance  was  estimated  by  ROC  analysis  by  calculating  AUC 
(from  PROPROC)  values  based  on  both  individual  features  alone,  and  merged.  For  merged 
classifiers,  AUC  values  were  found  using  round-robin  case-by-case  removal  and  replacement. 
Among  the  resulting  ,  computation  jobs  requiring  over  30  CPU  hours  on  a  single  lab  computer  were 
completed  in  approximately  35  minutes  in  this  preliminary  study.  Merged  AUC  values  increased 
from  0.50  (std.err.=0.018)  at  GW  of  lmm  with,  to  0.81  (std.err.=0.015)  at  1 0mni  GW,  with  relative 
plateaus  across  the  rest  of  the  parameter  space  to  60mm. 

In  general,  the  parameter  space  sweep  in  GW  identified  trends  in  individual  feature  performance 
as  well  as  merged  results.  Large  scale,  computationally  intensive  image  analysis  can  be  carried  out 
in  a  timely  fashion,  feasible  for  expedited  experimental  discovery,  as  well  as  for  more  thorough 
future  statistical  analysis. 

See  Appendix  A  for  Poster  summarizing  the  work-flow  and  data  found. 


2.  Investigation  of  Dimension  Reduction(DR)  in  Place  of  Feature  Selection  Breast 
CADx 
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Recently  developed  unsupervised  non-linear  dimension  reduction  (DR)  and  data  representation 
techniques  were  applied  to  computer-extracted  breast  lesion  feature  spaces  across  three  separate 
imaging  modalities:  ultrasound  (US)  with  1126  cases,  dynamic  contrast  enhanced-magnetic 
resonance  imaging  (DCE-MRI)  with  356  cases,  and  full-field  digital  mammography  (FFDM)  with 
245  cases.  Two  methods  for  non-linear  DR  were  explored:  Laplacian  Eigenmaps  of  Belkin  and 
Niyogi,[l]  and  t-distributed  stochastic  neighbor  embedding  (t-SNE)  of  van  der  Maaten  and 
Hinton.  [2]  These  methods  attempt  to  map  originally  high-dimensional  feature  spaces  to  more  human 
interpretable  lower-dimensional  spaces  while  preserving  both  local  and  global  information. 

Due  in  part  to  the  ever-growing  demand  of  data  driven  science,  in  recent  years  much  interest  has 
emerged  in  developing  techniques  for  discovering  efficient  representations  of  large-scale  complex 
data.  [5]  Conceptually  the  goal  is  to  discover  the  intrinsic  structure  of  the  data  and  adequately 
express  this  information  in  a  lower  dimensional  representation.  Classically,  the  problem  of  DR  and 
data  representation  has  been  approached  by  applying  linear  transformations  such  as  the  well-known 
Principal  Component  Analysis  (PCA)  or  more  general  Singular  Value  Decomposition  (SVD).  [6,7] 
Interestingly,  despite  PCA’s  age,  only  recently  has  this  method  been  considered  for  the  specific 
application  to  CADx  feature  space  reduction. [8]In  this  particular  breast  ultrasound  study,  while  no 
significant  boosts  in  lesion  classification  performance  were  discovered,  PCA  was  found  to  be  a 
suitable  substitute  in  place  of  more  computationally  intensive  and  cumbersome  feature  selection 
methods. [8]  This  efficient  lower  dimensional  PCA  data  representation,  i.e.  linear  combinations  of 
the  original  features  accounting  for  the  maximum  global  variance  decomposition  in  the  data,  proved 
capable  of  capturing  sufficient  information  for  robust  classification.  However,  PCA  is  not  capable 
of  representing  higher  order,  non-linear,  local  structure  in  the  data. 

The  goal  of  recently  proposed  non-linear  data  reduction  and  representation  methods  focuses 
on  this  very  problem.  [1,2]  The  present  methods  of  interest  to  this  study,  Laplacian  Eigenmaps  and  t- 
Distributed  Stochastic  Neighbor  Embedding  (t-SNE),  offer  two  distinct  approaches  for  explicitly 
addressing  the  challenge  of  capturing  and  efficiently  representing  the  properties  of  the  low 
dimensional  manifold  on  which  the  original  high-dimensional  data  may  lie.  Previous  studies  have 
investigated  other  non-linear  DR  techniques,  including  self-organizing  maps  (SOMs)  and  graph 
embedding,  for  breast  cancer  in  the  context  of  biomedical  image  signal  processing^, 10],  as  well  as 
for  a  breast  cancer  BIRADs  database  clustering[l  1],  To  our  knowledge  the  relationship  between 
breast  CADx  performance  and  these  non-linear  feature  space  DR  and  representation  have  yet  to  be 
properly  investigated.  These  new  techniques  may  contribute  two  key  enhancements  to  current 
CADx  schemes. 

1)  A  principled  alternative  to  feature  selection.  Both  methods  explicitly  attempt  to  preserve  as 
much  structure  in  the  original  feature  space  as  possible,  and  thus  require  no  need  to 
assumingly  force  exclusion  of  features  from  the  original  set,  and  hence  unnecessary  loss  of  image 
information. 

2)  A  more  natural  and  sparse  data  representation  that  immediately  lends  itself  to  generating 
human-interpretable  visualizations  of  the  inherent  structures  present  in  the  high-dimensional  feature 
data. 


3.  Evaluation  of  the  performance  Dimension  Reduction(DR)  in  Place  of  Feature 
Selection  Breast  CADx 

For  the  high-dimensional  feature  spaces  ,DR  methods  were  tested  across  all  modalities  for  a 
range  of  lower  target  dimensions  and  user-defined  algorithm  parameters.  We  evaluated  the  classifier 
performance  using  the  area  under  the  Receiver  Operating  Curve  ROC  curve  (AUC)  via  the  non- 
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parametric  Wilcoxon-Mann- Whitney  statistic,  as  calculated  using  the  PROPROC  software. 
Statistical  uncertainty  in  classification  performance  due  to  finite  sample  sizes  was  estimated  by 
implementing  0.632+  bootstrapping  methods  for  training  and  testing  the  classifiers.  Additionally, 
we  computed  the  95%  empirical  bootstrap  confidence  intervals  on  AUC  values  as  estimated  by  no 
less  than  500  bootstrap  case  set  re-samplings.  In  all  values  reported,  the  sampling  was  conducting 
on  a  by  lesion  basis,  as  there  may  be  multiple  images  associated  with  each  unique  lesion.  In  this 
regard,  during  classifier  testing,  the  set  of  classifier  outputs  associated  with  a  unique  lesion  were 
averaged  to  produce  a  single  value.  For  the  supervised  feature  selection  methods  (Automatic- 
relevance  determination  (ARD)  and  linear  step-wise),  feature  selection  was  conducted,  up  to  the 
specified  number  of  features,  on  each  bootstrapped  sample  set.  Notably,  the  more  general  Markov 
Chain  Monte  Carlo  based  Bayesian  artificial  neural  network  (MCMC-BANN)  was  coupled  with 
both  the  non-linear  ARD  and  linear-based  feature  selection  methods,  while  the  linear  discriminant 
analysis  (LDA)  was  only  with  the  LSW  selection.  As  some  of  the  calculations  are  computationally 
intensive,  particularly  the  t-SNE  mappings  and  MCMC-BANN  training  for  the  larger  US  data  set,  a 
25 6-CPU  shared. 

Results  are  summarized  in  depth  in  Appendix  B. 

4.  Investigation  Breast  CADx  Feature  Data  Representation  and  Visualization  with  Non- 
Linear  Local  Geometry  Preserving  Dimension  Reduction  Methods 

Please  see  Appendix  B  and  associated  figures  for  sample  breast  image  CADx  feature  data 
visualizations. 


5.  Investigation  of  Manifold  Regularization  for  Breast  CADx  using  Unlabeled 
Image  Data 

Supervised  classification,  which  embodies  the  traditional  role  of  CADx,  critically  requires 
that  the  “truth”  or  true  biological  disease  status,  for  instance  “malignant”  or  “benign”,  must  be 
known  for  each  case  image  during  algorithmic  training.  However,  accomplishing  this  data  assembly 
step  is  often  the  most  resource  expensive  component  of  conducting  CADx  research,  and  usually  acts 
as  a  severely  limiting  factor.  While  efforts  will  continue  to  streamline  the  gathering  of  pathological 
and  radiological  information  associated  with  each  clinical  case,  in  most  research  contexts,  a  relative 
abundance  of  readily  available  unlabeled  data  may  persist.  From  a  practical  standpoint  it  is  wasteful 
to  completely  discard  this  information.  Although  the  “truth”  may  be  unknown,  these  unlabeled 
cases  still  contain  potentially  useful  image  information.  In  particular,  the  unlabeled  image  data  can 
be  regarded  as  an  additional  sample  drawn  from  the  underlying  marginal  probability  distribution 
characteristic  of  the  combined  class-categories,  i.e.  both  “malignant”  and  “benign”.  A  large  enough 
unlabeled  data  sample  may  provide  sufficient  knowledge  of  the  inherent  structure  of  the  underlying 
marginal  image  distribution  to  guide  the  improved  design  of  supervised  classification  on  labeled 
cases.  More  simply  stated,  the  unlabeled  data  may  help  “regularize”  the  training  of  CADx 
algorithms,  and  consequently  improve  clinical  performance. 

Recently  developed  unsupervised  non-linear  dimension  reduction  and  data  representation 
techniques,  specifically  Laplacian  Eigenmaps  (Belkin  and  Niyogi)  and  t-SNE  (van  der  Maaten  and 
Hinton),  offer  a  principled  approach  for  integrating  unlabeled  and  labeled  image  feature  data.  Thus, 
the  purpose  of  this  study  is  to  investigate  the  potential  these  methods  have  for  leveraging  unlabeled 
data  information  towards  the  design  of  more  robust/stable  CAD  classification  algorithms.  This 
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preliminary  study  focuses  within  the  context  of  diagnosing  breast  mass  lesions  for  ultrasound. 

The  canonical  methodology  for  training/testing  CADx  algorithms  usually  consists  of  the 
following  distinct  steps:  1)  segmentation  and  feature  extraction,  2)  feature  selection,  3)  merged 
feature-  supervised  classifier  training,  and  4)  performance  evaluation.  The  incorporation  of 
unlabeled  data  into  the  CADx  algorithm  training  for  regularization  can  be  accomplished  by 
employing  unsupervised  dimension  reduction  in  place  of  explicit  feature  selection  in  step  two.  For 
example,  in  the  ultrasound  dataset  used  in  this  study,  up  to  8 1  features  are  extracted  from  the  lesion 
images.  Features  may  be  extracted  for  both  labeled  and  unlabeled  cases  (truth  known  and  unknown). 
Instead  of  using  supervised  feature  selection  (such  as  automatic  relevance  determination),  which  is 
dependent  exclusively  upon  the  labeled  cases,  unsupervised  dimension  reduction  can  be  used  to  map 
the  high  dimensional  feature  vectors,  including  the  unlabeled  feature  data,  into  a  lower  dimensional 
representation.  The  reduced  dimension  mapped  output  for  the  labeled  cases  may  then  be  used  as 
input  into  supervised  classification  training.  Crucially,  the  unlabeled  cases  have  exerted  influence 
on  the  relative  mapping  of  the  labeled  cases  used  to  train  the  classifier.  Ideally,  this  influence  serves 
as  a  regularizing  force,  leading  to  more  robust  performance  on  novel  cases.  Additionally,  the 
reduced  dimension  representations  may  be  amenable  to  useful  visualizations  in  2D  and  3D. 

Appendix  C  contains  graphic  illustrations  of  the  concept  of  dimensionality  reduction  as  well 
as  the  incorporation  of  unlabeled  into  the  CADx  training  algorithm. 
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KEY  RESEARCH  ACCOMPLISHMENTS 


•  Designed  and  executed  Swift  script  enabled  Grid  computing  work-flows,  and  managed  to 
display  clear  proof  of  principle  for  large  scale,  parallel  breast  image  analysis. 

•  Made  effective  use  of  the  new  256  CPU  SIRAF  Shared  Computing  Facility  at  the  University 
of  Chicago  Dept,  of  Radiology  both  as  a  test-bed  for  large  scale  parallel  job  submission  and 
workflow  management  and  to  rapidly  conduct  new  Multi-Modality  Breast  Image  CADx 
research,  culminating  in  a  peer-reviewed  journal  article  submission.  Estimated  Total  CPU  time 
used:  -100,000  to  300,000  hours. 

•  Investigated  the  use  of  cutting  edge  data-analysis/mining  techniques  as  applied  to 
Ultrasound,  FFDM,  and  DCE-MRI  Breast  Image  Feature  Space  Analysis  for  CADx  , 
specifically,  dimension  reduction  and  data  representation  techniques  (t-SNE  and  Laplacian 
Eigenmaps)  for  high  dimensional  data  spaces.  These  methods  allow  for  an  alternative  to 
traditional  feature  selection  methods.  Using  the  high-throughput  cluster  computing  capabilities, 
performance  metrics  and  intensive  statistical  cross-validation  (0.632+  bootstrap  and  ROC 
analysis  for  AUC  performance)  were  performed  to  gain  understanding  of  the  new  techniques 
potential  versus  previous  Breast  CADx  methodologies.  Results  indicate  the  ability  to  rival  or 
exceed  previous  CADx  performance. 

•  The  dimensional  reduction  and  data  representation  techniques  also  were  shown  to  provide 
rich  visualization  output  for  human  interpretation  of  the  complex  breast  image  feature  space 
geometry. 

•  Additionally,  the  promising  findings  and  have  motivated  a  number  of  new  research  avenues. 
Most  significantly,  the  incorporation  and  principled  use  of  "unlabeled"  (truth-unknown/non- 
biopsy  proven)  image  data  for  the  training  of  CADx  algorithms.  Specifically,  the  unsupervised 
dimension  reduction  techniques  can  use  the  feature  space  geometric  structure  to  help  regularize 
algorithmic  training. 
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REPORTABLE  OUTCOMES 


Conference  Presentations  and  Abstracts 

•  A.R.  Jamieson,  M  L  Giger,  M  Wilde,  L  Pesce,  I  Foster,  “Grid-Computing  for  Optimization 
of  CAD,”  Poster,  50th  Assembly  and  Annual  Meeting  of  American  Association  of  Physicist  in 
Medicine  (AAPM),  Houston,  Illinois,  USA,  July  2008 

•  A.R.  Jamieson,  ML  Giger,  L.  Pesce  “Regularized  Training  of  CADx  Algorithms  with 
Unlabeled  Data  Using  Dimension  Reduction  Techniques,”  Accepted  talk.  95nd  Assembly  and 
Annual  Meeting  of  Radiological  Society  of  North  America,  Chicago,  Illinois,  USA,  December 
2009. 

•  A.R.  Jamieson,  M  L  Giger,  et.  al.  “  Exploring  Non-Linear  Feature  Space  Dimension 
Reduction  and  Data  Representation  in  Breast  CADx”,  Accepted  talk.  5 1st  Assembly  and  Annual 
Meeting  of  American  Association  of  Physicist  in  Medicine  (AAPM)  Anaheim  CA,  USA,  July 
2009 

Peer-reviewed  Journal  Papers 

A.R.  Jamieson,  M.  L.  Giger,  et.  al.  “Exploring  Non-Linear  Feature  Space  Dimension 
Reduction  and  Data  Representation  in  Breast  CADx  with  Laplacian  Eigenmaps  and  t-SNE”, 
Med.  Phys,  {Accepted,  in  revision),  2009. 
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CONCLUSIONS 


Overall,  the  past  year  proved  to  be  highly  productive  for  the  recipient  of  the  Predoctoral 
Traineeship  Award,  in  terms  of  both  research  output  achieved,  leading  to  the  continued  personal 
development  as  an  independent  investigator,  as  well  as  the  intensive  didactic  training  and  education. 
The  recipient  has  completed  all  but  three  classes  (out  of  the  20  required)  and  will  finish  the  required 
course  load  by  Fall  2010,  at  the  latest.  The  courses  completed,  both  core  Medical  Physics  and 
research  oriented  theoretical  based  electives,  have  helped  to  guide  important  research  decisions. 

The  initial  research  effort  completed  during  this  first  year  has  not  only  produced  encouraging 
results  but  also  laid  an  excellent  foundation  for  beginning  large  scale  deployment  of  the 
experimental  techniques  onto  the  Grid  environment.  In  addition  to  successfully  deploying  a  proof 
of  principle  Grid  run,  we  have  used  the  local  256-CPU  cluster  effectively  to  gain  research  direction. 
Specifically,  we  have  managed  to  identity  and  learn  to  use  newly  developed  data  analysis  methods 
which  can  powerfully  enhance  the  proposed  objective  of  uncovering  important  information  from 
large  databases  of  breast  cancer  image  data.  These  new  methods  will  allow  us  to  interpret  the  nature 
of  the  underlying  structure  associated  with  image  data.  By  investigating  the  structure  of  the  multi- 
modality  breast  image  data,  we  can  then  correlate  the  findings  with  other  biological  and  genomic 
data  towards  maximizing  the  overall  impact  of  such  systems  for  future  clinical  deployment. 

Furthermore,  as  mentioned  in  the  report  above,  through  investigation  of  the  new  dimensional 
reduction  and  data  representation  techniques,  new  capabilities  have  been  identified  for  the  use  of 
"unlabeled"  image  feature  data.  This  is  particularly  of  interest,  as  the  vast  majority  of  clinically 
accumulated  data  is  never  analyzed  for  proof  of  biologic  origin  and  pathology  ("labeled").  Thus, 
introduction  of  these  new  methods  to  our  research  is  exciting,  as  not  only  will  more  data 
("unlabeled")  be  available  to  incorporate  into  our  algorithmic  development,  but  also,  ample 
opportunity  to  make  full  use  of  the  large  scale  computing  power  of  the  Grid. 
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Exploring  Non-Linear  Feature  Space  Dimension  Reduction  and  Data 
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Yuan,  and  Neha  Bhooshan 
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Abstract: 

In  this  preliminary  study,  recently  developed  unsupervised  non-linear  dimension  reduction 
(DR)  and  data  representation  techniques  were  applied  to  computer-extracted  breast  lesion  feature 
spaces  across  three  separate  imaging  modalities:  ultrasound  (US)  with  1126  cases,  dynamic  contrast 
enhanced-magnetic  resonance  imaging  (DCE-MRI)  with  356  cases,  and  full-field  digital 
mammography  (FFDM)  with  245  cases.  Two  methods  for  non-linear  DR  were  explored:  Laplacian 
Eigenmaps  of  Belkin  and  Niyogi,1  and  t-distributed  stochastic  neighbor  embedding  (t-SNE)  of  van 
der  Maaten  and  Hinton.2  These  methods  attempt  to  map  originally  high-dimensional  feature  spaces 
to  more  human  interpretable  lower-dimensional  spaces  while  preserving  both  local  and  global 
information.  The  properties  of  these  methods  as  applied  to  breast  computer-aided  diagnosis  (CADx) 
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were  evaluated  in  the  context  of  malignancy  classification  performance  as  well  as  in  the  visual 
inspection  of  the  sparseness  within  the  two-  and  three-dimensional  mappings.  Classification 
performance  was  estimated  by  using  the  reduced  dimension  mapped  feature  output  as  input  into  both 
linear  and  non-linear  classifiers:  Markov  Chain  Monte  Carlo  based  Bayesian  artificial  neural 
network  (MCMC-BANN)  and  linear  discriminate  analysis  (LDA).  The  new  techniques  were 
compared  to  previously  developed  breast  CADx  methodologies,  including  Automatic  Relevance 
Determination  (ARD)  and  linear  step-wise  (LSW)  feature  selection,  as  well  as  a  linear  DR  method 
based  on  Principal  Component  Analysis  (PCA).  Using  ROC  analysis  and  0.632+  bootstrap 
validation,  95%  empirical  confidence  intervals  were  computed  for  the  each  classifier’s  AUC 
performance.  Results:  In  the  large  US  dataset,  sample  high  performance  results  include,  AUC0.632+ 

=  0.88  with  95%  empirical  bootstrap  interval  [0.787 ;0.895]  for  13  ARD  selected  features  and 
AUC0  632+  =  0.87  with  interval  [0.817;0.906]  for  4  LSW  selected  features  compared  to  4D  t-SNE 
mapping  (from  the  original  81D  feature  space)  giving  AUC0632+  =  0.90  with  interval  [0.847;0.919], 
all  using  the  MCMC-BANN.  Conclusions:  Preliminary  results  appear  to  indicate  capability  for  the 
new  methods  to  match  or  exceed  classification  performance  of  current  advanced  breast  lesion  CADx 
algorithms.  While  not  appropriate  as  a  complete  replacement  of  feature  selection  in  CADx  problems, 
DR  techniques  offer  a  complementary  approach  which  can  aid  elucidation  of  additional  properties 
associated  with  the  data.  Specifically,  the  new  techniques  were  shown  to  possess  the  added  benefit  of 
delivering  sparse  lower-dimensional  representations  for  visual  interpretation,  revealing  intricate  data 
structure  of  the  feature  space. 

Keywords:  non-linear  dimension  reduction,  computer-aided  diagnosis,  breast  cancer,  Laplacian 
Eigenmaps,  t-SNE 
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I.  Introduction 


Radiologic  image  interpretation  is  a  complex  task.  A  radiologist's  expertise,  developed  only 
with  exhaustive  training  and  experience,  rests  in  their  ability  for  extracting  and  meaningfully 
synthesizing  relevant  information  from  a  medical  image.  However,  even  under  idealized  image 
acquisition  conditions,  precise  conclusions  may  not  be  possible  for  certain  radiologic  tasks.  Thus, 
computer  aided  diagnosis  (CADx)  systems  have  been  introduced  in  a  number  of  contexts  in  an 
attempt  to  assist  human  interpretation  of  medical  images.3  A  relatively  well-developed  clinical 
application  for  which  computerized  efforts  in  radiological  image  analysis  have  been  studied  is  the 
use  of  CAD  in  the  task  of  detecting  and  diagnosing  breast  cancer.4"10  Similar  to  the  radiologist’s 
task,  a  computer  algorithm  is  designed  to  make  use  of  the  highly  complicated  breast  image  input 
data,  attempting  to  intelligently  reduce  image  information  into  more  interpretable  and  ultimately 
clinically-actionable  output  structures,  such  as  an  estimate  of  the  probability  of  malignancy. 
Understanding  how  to  optimally  make  use  of  the  enormity  of  the  initial  image  information  input  and 
best  arrive  at  the  succinct  conceptual  notion  of  “diagnosis”  is  a  formidable  challenge.  Although 
there  may  be  any  number  of  various  operations/transformations  involved  in  arriving  at  this  high- 
level  end  output,  whether  in  the  human  brain  or  in  silico,  two  common  critical  pursuits  are  proper 
data  representation  and  reduction.  The  current  study  aims  to  explore  the  potential  enhancements 
offered  to  breast  mass  lesion  CADx  algorithms  through  the  application  of  two  recently-developed 
dimensionality  reduction  and  data  representation  techniques,  Laplacian  Eigenmaps  and  t-distributed 
stochastic  neighbor  embedding  (t-SNE).12 

II.  Background 

II. A.  Current  CADx  Feature  Representation 
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Restricted  by  limited  sample  datasets,  computational  power,  and  lack  of  complete  theoretical 
formalism,  image-based  pattern  recognition  and  classification  techniques  often  tackle  the  objective 
task  at  hand  by  substantially  simplifying  the  problem.  Traditionally,  breast  CADx  systems  employ 
a  two  pronged  approach,  first,  image  pre-processing  and  feature  extraction,  and  second, 
classification  in  the  feature  space,  either  by  unsupervised  methods,  supervised  methods,  or  both.  A 
review  of  past  and  present  CADx  methods  employed  can  be  found  in  referenced  articles 
referenced.3'1 1  Often,  instead  of  attempting  to  make  use  of  the  complete  image12,  CADx  typically 
condenses  image  information  down  to  a  vector  of  numerical  values,  each  representative  of  some 
attribute  of  the  image  or  lesion  present  in  the  image.  One  can  consider  this  first  data  reduction  step 
as  “perceptual”  processing,  meaning  that  at  this  stage  the  algorithm’s  goal  is  to  isolate  and 
“perceive”  only  the  most  relevant  components  of  the  original  image  that  will  contribute  towards 
distinguishing  between  the  target  classes  (e.g.,  malignant  or  benign).  One  of  the  steps  in  eliminating 
unnecessary  image  information  is  lesion  margin  segmentation. 5,13  Typically,  features,  such  as  those 
extracted  from  the  segmented  lesion,  are  heuristic  in  nature  and  mimic  important  human  identified 
aspects  of  the  lesion.  However  more  mathematical  and  abstract  feature  quantities  may  also  be 
calculated  that  may  represent  information  visually  imperceptible  to  the  unaided  eye.  While  the  use 
of  data  from  a  segmented  lesion  introduces  bias  into  the  algorithm’s  task  as  a  whole,  this  “informed” 
bias  allows  for  the  efficient  removal  of  much  unnecessary  image  data,  for  instance  normal 
background  breast  tissue.  From  here  the  second  main  component  of  the  CADx  algorithm  falls 
usually  into  the  context  of  the  well-formalized  canonical  problem  found  in  statistical  pattern 
recognition  for  classification14'15. 

After  the  first  CADx  phase  of  feature  extraction,  each  high-dimensional  image  in  the  sample 
set  is  now  reduced  to  a  single  vector  in  a  lower-dimensional  feature  space.  However,  due  to  the 
finite  size  of  image  sample  data,  if  too  many  features  are  examined  simultaneously,  regions 
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containing  a  low  density  of  points  in  the  feature  space  will  exist,  resulting  in  statistically 
inconclusive  classification  ability.  This  dilemma  is  affectionately  termed  the  “curse  of 
dimensionality.”16  Thus,  a  further  reduction  in  the  full  feature  space  is  required  for  a  practically 
useful  data  representation.  This  aspect  is  a  major  concern  of  the  second  component  of  traditional 
CADx  schemes,  and  is  succinctly  known  as  “feature  selection”.  Much  literature  has  been  generated 
on  this  subject  matter  in  the  explicit  context  of  improving  CADx  performance  1719  Some  CADx 
schemes  may  employ  only  4-5  features  maximum,  in  which  case,  feature  selection  may  not  be 
necessary,  since  the  dataset  sample  size,  even  for  relatively  smaller  sizes,  may  be  sufficiently  large  to 
avoid  over-training  classifiers.  However,  it  is  reasonable  to  imagine  CADx  researchers  interested  in 
testing  hundreds  of  potential  features.  In  either  case,  when  appropriately  coupled  with  a  well- 
regularized  supervised  classification  method,  the  ultimate  objective  of  features  selection  is  to 
discover  the  “optimal”  data  representation,  or  sub-set  of  features  for  robustly  maximizing  the  desired 
diagnostic  task  performance.  That  is,  the  method  attempts  both  to  mimic  and  to  maximize  the 
theoretical  upper  bound  or  ideal  observer  performance  possible  over  the  sampled  joint  probability 
distribution  of  the  selected  features.  While  this  step  is  critical,  finding  such  a  sub-set  is  non-trivial 
and  may  also  be  highly  dependent  on  the  specific  characteristics  of  the  sample  data.  Developed 
techniques  in  feature  selection  for  CADx  range  from  simpler  linear  methods,  such  as  those  based  on 
linear  discriminate  analysis  (LDA),  to  non-linear  and  more  sophisticated  Bayesian-based,  such  as 
the  use  of  Bayesian  Artificial  Neural  Networks  (BANN)  and  Automatic  Relevance  Determination 
(ARD),  to  random-search  stochastic  methods  such  as  genetic  algorithms  as  well  as  information 
theoretic  techniques17’19"21. 

The  most  striking  quality  of  the  methods  mentioned  above,  in  the  context  of  CADx,  is  that 
during  feature  selection,  some  features  are  completely  removed  from  the  final  classification  scheme, 
and  hence  image  information  is  either  explicitly  or  implicitly  discarded  altogether.  However,  while 
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removing  out  all  the  information  associated  with  a  specific  feature  not  selected,  by  selecting  a 
smaller  sub-set  of  individual  features,  what  is  gained  is  greater  immediate  human  interpretability. 
Specifically,  the  isolated  groups  of  features  may  have  clear  physical  or  radiological  meanings  and 
thus  may  be  of  interest  to  investigators  or  radiologists  for  understanding  how  these  characteristics 
relate  to  the  ability  to  distinguish  class  categories  (malignant,  benign,  cyst,  etc..).  To  this  end,  in 
order  to  interpret  the  nature  of  the  feature  space  and  attempt  to  identify  characteristic  trends,  one 
may  visually  inspect  plots  displaying  single  features  or  attempt  to  capture  synergistic  qualities 
between  two  or  three  features  simultaneously.  Above  three  dimensions,  as  it  becomes  non-trivial  to 
interpret  the  structure  of  the  feature  space,  often  instead,  the  use  of  a  metrics  such  at  the  ROC  curve 
and/or  AUC  based  on  output  from  the  decision  variable  of  a  trained  merged  feature  classifier  are 
used  to  interrogate  the  quality  of  the  higher  dimensional  feature  spaces. 

As  such,  beyond  identifying  which  feature  or  features  appear  to  hold  classification  utility, 
current  CADx  methods  offer  little  theoretical/fomial  guidance  in  a  recovering  understanding  of  the 
inherent  data  structure  represented  by  the  higher  dimensional  feature  spaces. 

II.B.  Proposed  Feature  Space  Representation  and  Reduction  for  CADx 

Due  in  part  to  the  ever-growing  demand  of  data  driven  science,  in  recent  years  much  interest 
has  emerged  in  developing  techniques  for  discovering  efficient  representations  of  large-scale 
complex  data.22  Conceptually  the  goal  is  to  discover  the  intrinsic  structure  of  the  data  and 
adequately  express  this  information  in  a  lower  dimensional  representation.  Classically,  the  problem 
of  dimension  reduction(DR)  and  data  representation  has  been  approached  by  applying  linear 
transformations  such  as  the  well-known  Principal  Component  Analysis  (PCA)  or  more  general 
Singular  Value  Decomposition  (SVD). 23  24  Interestingly,  despite  PCA’s  age,  only  recently  has  this 
method  been  considered  for  the  specific  application  to  CADx  feature  space  reduction.25  In  this 
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particular  breast  ultrasound  study,  while  no  significant  boosts  in  lesion  classification  performance 
were  discovered,  PCA  was  found  to  be  a  suitable  substitute  in  place  of  more  computationally 
intensive  and  cumbersome  feature  selection  methods.25  This  efficient  lower  dimensional  PCA  data 
representation,  i.e.  linear  combinations  of  the  original  features  accounting  for  the  maximum  global 
variance  decomposition  in  the  data,  proved  capable  of  capturing  sufficient  information  for  robust 
classification.  However,  PCA  is  not  capable  of  representing  higher  order,  non-linear,  local  structure 
in  the  data. 

The  goal  of  recently  proposed  non-linear  data  reduction  and  representation  methods  focuses 
on  this  very  problem.1,2  The  present  methods  of  interest  to  this  study,  Laplacian  Eigenmaps  and  t- 
Distributed  Stochastic  Neighbor  Embedding  (t-SNE),  offer  two  distinct  approaches  for  explicitly 
addressing  the  challenge  of  capturing  and  efficiently  representing  the  properties  of  the  low 
dimensional  manifold  on  which  the  original  high-dimensional  data  may  lie.  Previous  studies  have 
investigated  other  non-linear  DR  techniques,  including  self-organizing  maps  (SOMs)  and  graph 
embedding,  for  breast  cancer  in  the  context  of  biomedical  image  signal  processing26,27,  as  well  as  for 
a  breast  cancer  BIRADs  database  clustering.28  To  our  knowledge  the  relationship  between  breast 
CADx  performance  and  these  non-linear  feature  space  DR  and  representation  have  yet  to  be 
properly  investigated.  These  new  techniques  may  contribute  two  key  enhancements  to  current 
CADx  schemes. 

1 .  A  principled  alternative  to  feature  selection.  Both  methods  explicitly  attempt  to  preserve 
as  much  structure  in  the  original  feature  space  as  possible,  and  thus  require  no  need  to 
assumingly  force  exclusion  of  features  from  the  original  set,  and  hence  unnecessary  loss 
of  image  information. 

2.  A  more  natural  and  sparse  data  representation  that  immediately  lends  itself  to  generating 
human-interpretable  visualizations  of  the  inherent  structures  present  in  the  high- 
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dimensional  feature  data. 


It  is  important  to  note  that  by  employing  DR  on  CADx  feature  spaces,  one  surrenders,  to  a  varying 
extent,  the  ability  to  immediately  interpret  the  physical  meaning  of  the  embedded  representation. 

Yet,  critically,  this  is  a  necessary  and  fundamental  trade-off,  as  the  conceptual  focus  is  shifted  to  a 
more  holistic  approach,  specifically,  that  of  discovering  an  efficient  lower  dimensional 
representation  of  the  intrinsic  data  structure.  The  core  tenant  of  such  an  unsupervised  approach  is  to 
limit  assumptions  imposed  on  the  data.  This  major  shift  in  philosophy  regarding  the  original  high 
dimensional  feature  space  embodies  the  notion,  “let  the  data  speak  for  itself.”  It  seems  reasonable  to 
assume  that  if  supervised  classifiers  are  capable  of  uncovering  sufficient  data  structure  in  the 
extracted  feature  space  for  producing  adequate  classification  performance,  then  such  principled  local 
geometry  preserving  reduction  mappings  should  reveal  structural  evidence  corroborating  such 
findings. 

II.C.  Outline  of  Evaluation  for  Proposed  Methods 

The  primary  objective  of  this  study  is  to  evaluate  the  classification  performance 
characteristics  of  breast  lesion  CADx  schemes  employing  the  Laplacian  Eigenmap  or  t-SNE  DR 
techniques  in  place  of  previously  developed  feature-selection  methods.  Secondly,  and  more 
qualitatively,  we  aim  to  investigate  and  gain  insight  into  the  properties  of  sample  visualizations 
representative  of  lower-dimensional  feature  space  mappings  of  high-dimensional  breast  lesion 
feature  data.  Additionally,  the  feasibility  and  robustness  of  these  non-linear  reduction  methods  for 
CADx  feature  space  reduction  are  tested  across  three  separate  imaging  modalities:  ultrasound  (US), 
dynamic  contrast  enhanced  MRI  (DCE-MRI),  and  full-field  digital  mammography  (FFDM),  having 
case  sets  of  1126  case,  356  cases,  and  245  cases,  respectively. 
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III.  Methods 


III. A.  Dataset 

All  data  characterized  in  this  study  consists  of  clinical  breast  lesions  presented  in  images 
acquired  at  the  University  of  Chicago  Medical  Center.  Lesions  are  labeled  according  to  the  truth 
known  by  biopsy  or  radiologic  report  and  collected  under  HIPAA-compliant  IRB  protocols. 
Furthermore,  the  breast  lesion  feature  datasets  were  generated  from  previously  developed  CADx 
algorithms  at  the  University  of  Chicago.  For  a  review  of  these  techniques  see  Giger,  Huo,  Kupinski 
for  X-ray  mammography,  Drukker,  for  US,  and  Chen  for  DCE-MRI.  4-11,29 

In  each  of  the  modalities,  the  lesion  center  is  identified  manually  for  the  CADx  algorithm, 
which  then  performs  automated-seeded  segmentation  of  the  lesion  margin  followed  by  computerized 
feature  extraction.  Table  1  below  summarizes  the  content  of  the  respective  imaging  modality 
databases  used,  including  the  total  number  of  initial  lesion  features  extracted.  Note  that  the 
mammographic  imaging  modality  (FFDM)  contains  only  two  lesion  class  categories,  malignant  and 
benign.  For  ultrasound  and  DCE-MRI  a  more  detailed  sub-categorization  is  provided,  including 
invasive  carcinoma  (IDC),  ductal  carcinoma  in  situ  (DCIS),  benign  solid  masses,  and  benign  cystic 
masses.  For  clarity,  this  initial  study  only  considers  binary  classification  performance  in  the  task  of 
distinguishing  between  the  more  broad  identity  of  malignant  and  benign  (cancerous  vs.  non- 
cancerous).  However,  during  qualitative  inspection  of  the  dimension  reduced  mappings,  it  will  be  of 
interest  to  re-introduce  these  distinctions  for  visualization  purposes. 


Total 

Number  of 

Total  Number  of 

Number  of 

Modality 

Number  of 

Malignant 

Lesion  Features 

Benign  Lesions 

Images 

Lesions 

Calculated 
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US 

2956 

158 

968  ( 401  mass/ 

567  cystic) 

81 

DCE-MRI 

356 

223  (151  IDC  / 

72  DCIS) 

133 

31 

FFDM 

735 

132 

113 

40 

Table  1.  Feature  Database  Characteristics. 


Geometric,  texture,  and  morphological  features,  such  as  margin  sharpness,  were  extracted  across  all 
modalities.  Also,  the  DCE-MRI  dataset  includes  kinetic  features,  and  the  US  features  include  those 
related  to  posterior  acoustic  behavior. s’ 10  All  raw  extracted  feature  value  datasets  were  normalized 
to  zero  mean  and  divided  by  the  unit  sample  standard  deviation.  Due  to  page  limitations,  the  details 
of  each  feature  can  be  found  in  the  referenced  papers.4"11,29 

III.B.  Classifiers 

In  our  evaluation  of  the  new  DR  techniques,  we  chose  two  types  of  classifiers:  a  relatively 
simple  linear  discriminant  analysis  (LDA)  classifier  and  a  more  sophisticated  non-linear,  Bayesian 
artificial  neural  network,  classifier  (BANN).  15  LDA  is  a  well-known  and  commonly  used  linear 
classification  method  which  will  not  be  reviewed  here,  for  reference  and  examples  in  breast  lesion 
CADx  see  references.  4,30,31  The  BANN,  as  the  name  suggests,  follows  the  usual  multi-layer- 
perception,  neural  network  design,  but  additionally  employs  Bayesian  theory  as  a  means  of  classifier 
regularization  15,32.  The  BANN  has  been  shown  to  model  the  optimal  ideal  observer  for 
classification  given  sufficient  sample  sizes  as  input  for  training.33  The  critical  technical  hurdle  in 
implementing  BANNs  lies  in  accurately  estimating  posterior  weight  distributions,  as  analytical 
calculation  is  intractable.  As  such,  either  approximation  or  sampling  based  methods  must  be 
deployed  in  practice.34  Markov  Chain  Monte  Carlo  (MCMC)  sampling  methods  can  be  used  to 
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directly  sample  from  the  frill  posterior  probability  distribution.32  We  implemented  a  MCMC-BANN 
classifier  using  Nabney’s  Netlab  package  for  MATLAB.35  The  following  network  architecture,  k— 
(k+l)--l ,  was  used.  That  is,  k  input  layer  nodes  (one  for  each  of  the  k  selected  features),  a  hidden 
layer  with  (k  +  1)  nodes,  and  a  single  output  target  as  probability  of  malignancy.  For  each  classifier 
trained,  we  generated  at  least  2000  MCMC  samples  of  the  weights’  posterior  probability 
distribution.  The  mean  value  of  the  classification  prediction  (probability  of  malignancy)  output 
from  each  of  the  different  2000  weight  samples  was  used  to  produce  a  single  classification  estimate 
for  new  test  input  cases. 

III.C.  Explicit  Supervised  Feature  Selection  Methods 

Two  previously  developed  feature  selection  methods  are  considered  in  this  paper  for 
comparison,  and  include  linear  step-wise  and  ARD  feature  selection.  These  methods  are  used  to 
identify  a  specific  set  of  features  for  input  into  the  classifier. 

Linear  Stepwise  Feature  Selection 

Linear  step-wise  feature  selection  (LSW-FS)  relies  on  linear  discriminant-based  functions. 
Beginning  with  only  a  single  selected  feature,  multiple  combinations  of  features  are  considered  one 
at  a  time,  by  exhaustively  adding,  retaining,  or  removing  each  subsequent  feature  to  the  potential  set 
of  selected  features.  For  each  new  combination,  a  metric,  the  Wilks’  lambda  is  calculated  and  a 
selection  criterion  based  on  F-statistics  is  used.17  The  “F-to-enter”  and  “F-to-remove”  used  in  this 
study  were  automatically  adjusted  to  allow  for  the  specified  number  of  features  desired  for  US, 
DCE-MRI,  and  FFDM  feature  selection.  For  examples  of  LSW-FS  use  in  breast  CADx  references 
are  provided.17'25’30 
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Automatic  Relevance  Determination 


A  consequence  of  the  BANNs  is  the  possibility  for  joint  feature  selection  and  classification 
using  Automatic  Relevance  Determination  (ARD).  ’  ’  ’  ARD  works  by  placing  Bayesian  hyper¬ 
priors,  also  known  as  hierarchical  priors,  over  the  initial  prior  distributions  already  imposed  on  the 
network  weights  connected  to  the  input  nodes.  The  “relevant”  features  are  then  discovered  as 
estimates  for  the  hyper  parameters,  which  characterize  the  prior  distributions  over  the  respective 
input  layer  weights,  are  updated  via  Gibbs  sampling  giving  the  posterior  hyper-parameter  estimate. 
The  magnitudes  of  the  final,  converged  upon  hyper-parameters  are  then  used  to  indicate  the  relative 
utility  of  the  respective  feature  input  layer  weights  towards  accomplishing  the  classification  task. 
Thus,  by  way  of  the  Bayesian  regularization,  ARD  allows  for  one-shot  feature  selection  and 
classifier  design.  Furthermore,  a  key  advantage  of  ARD  feature  selection  is  its  ability  to  identify 
important  non-linear  features  coupled  to  the  classification  objective,  due  to  the  inherent  non-linear 
nature  of  BANN.19  Due  to  these  qualities,  ARD-MCMC-BANN  classifiers  were  also  included  for 
comparison  in  our  study. 

In  this  study  we  extend  MCMC-BANN  to  incorporate  ARD  following  the  implementation  of 
Nabney. 35  This  methodology  was  previously  investigated  for  breast  feature  selection  and 
classification  in  DCE-MRI  CADx.19  In  our  study,  1000  samples  were  calculated  for  the  hyper¬ 
parameters  beginning  with  a  gamma  hyper-prior  distribution  of  mean  parameter  value  equal  to  3 
and  a  shape  parameter  equal  to  4. 

III.D.  Unsupervised  Dimension-Reduction  Feature  Mappings 

In  comparison  to  the  supervised  feature-selection  methods,  three  unsupervised  DR  methods 
were  evaluated  here;  the  latter  two  non-linear  methods  are  offered  as  a  novel  application  to  the  field 
of  breast  image  CADx.  The  general  problem  of  dimensionality  reduction  can  be  described 
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mathematically  as:  provided  an  initial  set  xh  xk  of  k  points  in  Rl ,  discover  a  set  yb  yk  in  R"‘ 

such  that  yt  sufficiently  describes  or  “represents”  the  qualities  of  interest  found  in  the  original  set  xt. 
In  the  context  of  breast  lesion  CADx  feature  extraction,  the  ideally  lower  dimensional  mappings 
should  aim  to  preserve  and  represent  as  much  relevant  structural  information  towards  the  task  of 
malignancy  estimation.  It  should  be  noted  that  DR  still  requires,  in  some  sense,  “feature  selection,” 
meaning,  one  must  specify  the  number  of  mapped  dimensions  to  retain  for  the  subsequent 
classification  step.  Ideally,  methods  designed  to  estimate  intrinsic  dimensionality  of  the  data 
structure  could  be  used  to  direct  this  choice.36  However,  proper  evaluation  of  the  integrity  of  such 
methods  in  this  context  is  beyond  the  scope  of  this  research  effort.  Thus,  in  approaching  the  problem 
from  a  more  naive  perspective,  as  done  here,  focus  is  centered  on  gaining  a  general  intuition  for  the 
overall  major  trends  encountered. 

Linear  Feature  Reduction:  PC  A 

Mathematically,  PCA  is  linear  transformation  which  maps  the  original  feature  space  onto 
new  orthogonal  coordinates.  The  new  coordinates,  or  principal  components  (PC),  represent  ordered 
orthogonal  data  projections  capturing  the  maximum  variance  possible,  with  the  first  PC 
corresponding  to  the  highest  global  variance.23’24  Drukker,  et  al.  used  PCA  as  an  alternative  to 
feature  selection  for  breast  US  CADx.25 

Non-linear  Feature  Dimension  Reduction 

As  discussed  in  the  introduction  and  background  sections,  the  following  two  recently 
proposed  DR  and  data  representation  methods  are  non-linear  in  nature  and  specifically  designed  to 
address  the  problem  of  local  data  structure  preservation.  Laplacian  Eigenmaps  and  t-SNE  offer 
highly  distinct  solutions  to  this  problem. 
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i.  Laplacian  Eigenmaps 

Drawing  on  familiar  concepts  found  in  spectral  graph  theory,  Laplacian  Eigenmaps, 
proposed  by  Belkin  and  Niyogi  in  2002,  use  the  notion  of  a  graph  Laplacian  applied  to  a  weighted 
neighborhood  adjacency  graph  containing  the  original  data  set  information.1  This  weighted 
neighborhood  graph  is  regarded  geometrically  as  a  manifold  characterizing  the  structure  of  the  data. 
The  eigenvalues  and  eigenvectors  are  computed  for  the  graph  Laplacian  which  are  in  turn  utilized 
for  embedding  a  lower  dimensional  mapping  representative  of  the  original  manifold.  Acting  as  an 
approximation  to  the  Laplace  Beltrami  operator,  the  weighted  graph  Laplacian  transformation  can 
be  shown,  in  a  certain  sense,  to  optimally  preserve  local  neighborhood  information.  37  Thus,  the 
feature  data  considered  in  the  reduced  dimensional  space  mapping  is  essentially  a  discrete 
approximate  representation  of  the  natural  geometry  of  the  original  continuous  manifold. 

As  Belkin  and  Niyogi  note,  the  algorithm  is  relatively  simple  and  straightforward  to 
implement.  Additionally,  the  algorithm  is  not  computationally  intensive.  For  our  largest  dataset  the 
mappings  were  computed  within  a  few  seconds  using  MATLAB  code.  Algorithm  details  as  well  as 
explanation  of  necessary  input  parameters  for  the  implementation  used  here  are  provided  below  in 
section  VIILA  of  the  Appendix. 

It  is  important  to  note  that  there  is  no  theoretical  justification  for  how  to  choose  the  needed 
parameters  for  the  algorithm.  Thus,  an  array  of  parameter  choices  was  evaluated  in  this  study. 
Lastly,  parts  of  the  MATLAB  code,  related  only  to  the  implementation  of  the  Laplacian  Eigenmap, 
were  modified  from  the  publically  available  dimension  reduction  toolbox  provided  by  Laurens  van 
der  Maaten  of  Maasticht  University.38 

ii.  t-Distributed  Stochastic  Neighbor  Embedding  (t-SNE) 
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The  other  non-linear  mapping  technique  considered,  t-Distributed  Stochastic  Neighbor 
Embedding  (t-SNE)  of  van  der  Maaten  and  Elinton2,  approaches  the  dimension  reduction  and  data 
reduction  problem  by  employing  entirely  different  mechanisms  to  the  Laplacian  Eigenmaps.  t-SNE 
attacks  DR  from  a  stochastic  and  probabilistic-based  framework.  While  requiring  orders  of 
magnitude  more  computational  effort,  such  statistically-oriented  approaches,  provided  they  are  well- 
conditioned,  may  potentially  offer  greater  flexibility  in  certain  contexts  due  in  part  by  the  lessening 
of  potentially  restrictive  theoretical  mathematical  formalism.  For  these  reasons  the  t-SNE  method 
was  considered  as  an  interesting  comparison  alongside  the  Laplacian  Eigenmap. 

t-SNE  is  an  improved  variation  on  the  original  stochastic  neighbor  embedding  (SNE)  of 
Hinton  and  Rowies.39  The  basic  idea  behind  SNE  is  to  minimize  the  difference  between  specially 
defined  conditional  probability  distributions  that  represent  similarities,  calculated  for  the  data  points 
in  both  the  high  and  low  dimensional  representations.  In  particular,  SNE  begins  by  first  computing 
the  conditional  probability  given  by 


P 


j\i 


exp(- \\xj  ~xj ||/2cr* 2 ) 

Zte-expf_llx'“^H/2c7'2) 


and 


exp 

[HI 

L  “W 

r) 

y 

exp 

(-  L  - 

yk 

f] 

(1) 


and  in  the  lower  dimensional  space  with  p{ ,  and  qiV  set  to  zero.  These  similarities  express  the 
probability  that  x,  (v,)  would  select  xf  (y/)  as  its  neighbor,  resulting  in  high  values  for  nearby  points 
and  lower  values  for  distantly  separated  ones.  The  central  assumption  in  SNE  is  that  if  the  low¬ 
dimensional  mapped  points  in  Y  space  correctly  model  the  similarity  structure  of  its  higher¬ 
dimensional  counterparts  in  X,  then  the  conditional  probabilities  will  be  equal.  The  summed 
Kullback-Leibler  (KL)  divergence  is  used  to  gauge  how  well  qjl;  models p^.  Using  gradient  descent 
methods,  SNE  minimizes  a  KL  based  cost  function.  Sampled  points  from  an  isotropic  Gaussian 
with  small  variance  centered  at  the  origin  are  used  to  initialize  the  gradient  decent.  Updates  are 
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made  to  the  mapped  space  Y  for  each  iteration.  Additionally,  the  parameter  cr,-  of  eq  (1)  must  be 
selected,  cr,  is  the  variance  in  the  Gaussian  centered  on  the  high  dimensional  point  xr  Because  of  the 
difficultly  in  determining  if  an  optimal  cr,  exists,  a  user  defined  property  called  perplexity  is  used  to 
facilitate  its  selection,  defined  by  Perp(P)=2H<Pi> .  Calculated  in  bits,  H(P)  is  the  Shannon  entropy 
over  P, 

H(Pi  )  =  -^Pj\i  log  2  Pj\i  (2) 

j 

During  SNE,  a  binary  search  is  performed  to  find  the  value  of  cr,  that  produces  a  P,  with  the  user 
specified  perplexity.  Suggested  typical  settings  range  between  5  and  50. 2 

t-SNE  introduces  two  critical  improvements  to  SNE.2  First,  the  gradient  as  well  as  cost 
function  optimization  is  simplified  by  using  symmetrized  conditional  probabilities  to  define  the  joint 
probabilities  on  P  and  Q  (e.g.  pu  =  ,  +pi[j)/2n)  and  the  minimizing  cost  over  a  single  KL  divergence 

as  opposed  to  a  sum, 
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Second,  the  distributional  form  of  the  low-dimensional  joint  probabilities  is  changed  from  a 
Gaussian,  to  the  heavier  tailed  Student  t-distribution  with  one  degree  of  freedom.  Roughly,  this 
promotes  a  greater  probability  for  moderately  distanced  data  points  in  high  dimensional  space  to  be 
expressed  by  a  larger  distance  in  the  low-dimensional  map,  thus  more  “faithfully”  representing  the 
original  distance  structure,  and  avoiding  the  “crowding  problem.”  2  The  new  cy;/  is  defined  as 
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After  incorporating  the  altered  qi},  the  final  gradient  for  the  cost  function  is  given  by 
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TT  =  4X  {p‘i  ~  qij  -  yj  )(1  +  ly<  ~  yj  f  )  1 

fyi  i 

A  step  by  step  algorithm  outline  for  t-SNE  is  provided  in  section  VIII. B  of  the  Appendix. 

As  recommended  by  Hinton  and  van  der  Maaten2,  PCA  is  first  applied  to  the  high¬ 
dimensional  input  data  in  order  to  expedite  the  computation  of  the  pairwise  distances.  Lastly,  as  t- 
SNE  was  developed  primarily  for  2D  and  3D  data  representation  and  visualization,  it  is  important  to 
note  that  the  authors  warn  performance  of  t-SNE  is  not  well  understood  for  the  general  purpose  of 
DR.2  By  applying  t-SNE  to  the  CADx  feature  reduction  problem  we  hope  to  offer  at  least  some 
empirical  insight  towards  understanding  its  properties  in  such  contexts.  We  used  van  der  Maaten 
publicly  available  t-SNE  MATLAB  code  and  Intel  processer  optimized  “fast_tsne”  to  generate  the 
present  data  mappings40. 

III.E.  Classifier  Performance  Estimation  and  Evaluation 

The  high-dimensional  feature  spaces  DR  methods  were  tested  across  all  modalities  for  a 
range  of  lower  target  dimensions  and  user-defined  algorithm  parameters.  We  evaluated  the  classifier 
performance  using  the  area  under  the  Receiver  Operating  Curve  ROC  curve  (AUC)  via  the  non- 
parametric  Wilcoxon-Mann- Whitney  statistic,  as  calculated  using  the  PROPROC  software.41"43 
Statistical  uncertainty  in  classification  performance  due  to  finite  sample  sizes  was  estimated  by 
implementing  0.632+  bootstrapping  methods  for  training  and  testing  the  classifiers.44'31  Additionally, 
we  computed  the  95%  empirical  bootstrap  confidence  intervals  on  AUC  values  as  estimated  by  no 
less  than  500  bootstrap  case  set  re-samplings.  In  all  values  reported,  the  sampling  was  conducting 
on  a  by  lesion  basis,  as  there  may  be  multiple  images  associated  with  each  unique  lesion.  In  this 
regard,  during  classifier  testing,  the  set  of  classifier  outputs  associated  with  a  unique  lesion  were 
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averaged  to  produce  a  single  value.  For  the  supervised  feature  selection  methods  (ARD  and  LSW), 
feature  selection  was  conducted,  up  to  the  specified  number  of  features,  on  each  bootstrapped  sample 
set.  Notably,  the  more  general  MCMC-BANN  was  coupled  with  both  the  non-linear  ARD  and 
linear-based  feature  selection  methods,  while  the  linear  LDA  was  only  with  the  linear  stepwise 
feature  selection.  As  some  of  the  calculations  are  computationally  intensive,  particularly  the  t-SNE 
mappings  and  MCMC-BANN  training  for  the  larger  US  data  set,  a  256-CPU  shared  computing 
resource  cluster  was  employed  to  accomplish  runs  in  a  feasible  time  frame. 

IV.  Results 

IV.A.  Classification  Performance. 

MCMC-BANN  and  LDA  classification  performance  is  plotted  as  a  function  of  the  mapped 
or  feature  selected  input  space  dimension  for  the  three  datasets,  US,  DCE-MRI,  and  FFDM,  using 
the  three  different  DR  techniques,  as  well  as  the  non-reduced  selected  features  in  Figure  1  (a-f). 
Performance  is  characterized  in  terms  of  the  0.632+  bootstrapped  AUC  (left  axis)  and  variability  as 
gauged  by  the  width  of  the  empirical  95%  bootstrap  interval  (right  axis).  The  t-SNE  perplexity  was 
set  to  Perp  =  30  and  Laplacian  Eigenmaps  were  generated  with  Nearest  Neighbor= 45  and  t=  1 .0. 
Overall,  the  highest  classification  performance  was  attained  by  the  largest  sample-size  US  feature 
dataset  with  the  DR-MCMC-BANN  just  slightly  eclipsing  the  LDA,  achieving  approximately 
AUC0 .632+  ~  0.90, while  the  smaller  DCE-MRI  and  FFDM  feature  data  produced  peaks  around 
AUC0 .632+  ~  0.80.  The  variability  in  bootstrapped  AUCs  is  also  lowest  for  the  large  US  dataset, 
hovering  near  ~  0.07  as  the  number  of  inputs  into  the  classifier  is  increased. 

A  few  key  observations  can  be  made  from  the  results  regarding  the  use  of  DR.  Primarily,  the 
DR  techniques,  for  both  linear  (PCA)  and  non-linear  (t-SNE  and  Laplacian  Eigenmaps),  overall, 
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appear  to  at  least  match,  or  and  in  some  cases  exceed,  explicit  feature  selection  classification 
AUC0 .632+  performance.  This  is  most  evident  when  compared  to  the  the  ARD-FS  coupled  with  the 
MCMC-BANN  performance  across  all  three  imaging  modalities,  Figures  1(a),  1(c),  1(e)  (left  axis). 
Specifically,  in  all  cases  the  DR  methods  exhibited  a  more  rapid  rise  to  peak  AUC0  632+  performance 
and  remained  higher  than  the  ARD-based  feature  selection  for  all  dimension  input  sizes. 
Additionally,  compared  to  the  ARD  feature  selection  approach,  the  DR  methods  produced  less 
variability  in  the  bootstrap  AUC.  Figures  1(a),  1(c),  1(e)  (right  axis)  substantially  highlight  this 
phenomenon.  In  particular,  for  the  US  data,  the  ARD-FS  variability,  being  greater  that  of  than  all 
the  DR  methods,  clearly  trends  downward  as  more  features  are  selected  for  input;  gradually 
approaching  the  DR  variability  levels,  yet  usually  remaining  higher.  By  comparison,  save  for  a 
slight  increase  at  ID,  the  DR  variability  is  relatively  consistent  from  2D  to  13D. 

However,  when  coupled  with  the  LSW  feature  selection,  the  MCMC-BANN  produced  more 
competitive  results  against  the  DR  performance.  For  example,  for  this  MRI  data  set,  except  for  10D 
and  1  ID,  the  LSW-MCMC-BANN  edged  above  all  the  DR  based  methods.  Likewise,  the  use  of 
the  LSW  feature  selection  with  the  MCMC-BANN  resulted  in  substantially  reduced  variation  in 
classifier  perfomiance  compared  to  the  ARD-FS.  The  LSW-MCMC-BANN  variation  nearly 
matched  the  DR  output  for  both  the  US  and  MRI  across  all  input  dimensions.  For  the  FFDM  data, 
except  for  2D-5D,  the  LSW-MCMC-BANN  held  close  to  the  DR  variation  level. 

The  less  complex,  yet  more  stable  LDA  classifier,  Figures  1  (b),  1  (d),  1  (f)(left  axis),  produced 
different  characteristic  results.  In  all  cases  the  LSW-feature  selection  performance  was  initially 
higher,  however,  as  the  dimension  input  space  was  increased,  the  DR  methods  became  comparable. 
Expectedly,  when  coupled  with  the  linear  LDA,  the  highly-non-linear  stochastic  based  t-SNE  DR 
consistently  underperformed.  Turning  to  variation  for  the  LDA,  Figure  1(b),  1(d),  1(f)  (right  axis), 
the  LSW-FS  again  exhibited  different  behavior  from  ARD-FS,  in  that,  except  for  the  smaller-case- 
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sized  FFDM  data,  variability  does  not  considerably  fluctuate  moving  from  ID  to  13D  for  both  the 
LSW-FS  and  DR  methods. 

One  manner  by  which  to  concisely  analyze  the  performance  characteristics  of  dimension- 
reduction/feature  selection  and  classifiers  designs  for  a  particular  dataset  is  to  plot  the  bootstrap 
cross-validation  AUC  against  the  variability.  An  example  is  provided  for  the  US  feature  dataset  in 
Figure  2,  with  each  point  representing  a  different  number  of  input  dimensions.  Data  points  located 
in  the  upper  left  comer  indicate  the  most  preferred  performance  qualities,  i.e.,  higher  classification 
performance  and  lower  expected  variability.  Also  provided  in  Figure  5,  is  a  plot  displaying 
classification  results  for  both  MCMC-BANN  and  LDA,  in  terms  of  the  bootstrap  AUC  for  the  US 
data.  Included  within  this  plot  are  the  empirical  95%  confidence  intervals  to  aid  in  gauging 
statistical  significance  for  differences  between  estimated  AUC  values. 
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Figure  1 .  The  0.632+  bootstrap  area  under  the  ROC  curve  (AUC)  (left  axis)  and  the  variation  as  measured  by  the  width  of  the  95% 
empirical  bootstrap  confidence  intervals  (right  axis)  versus  the  selected  feature  (ARD.LSW)  or  reduced  representation  (PCA,t- 
SNE, Laplacian  Eigenmap)  classifier  input  space  dimension,  (a)  MCMC-BANN,  (b)  LDA,  classifier  performance  on  the  originally  81 
dimensional  US  feature  dataset,  (c)  MCMC-BANN.  (d)  LDA  classifier  performance  on  the  originally  31  dimensional  DCE-MRI 
feature  dataset,  (e)  MCMC-BANN,  (f)  LDA  classifier  performance  on  the  originally  40  dimensional  FFDM  feature  dataset. 
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Figure  2.  Summary  of  the  classification  performance  on  the  81  dimensional  US 
feature  dataset.  The  0.632+  bootstrapped  area  under  the  ROC  curve  versus 
variability  as  gauged  by  the  width  of  the  95%  empirical  bootstrap  confidence 
intervals.  Each  point  corresponds  to  a  different  input  space  dimension  size.  Points 
located  in  the  upper  left  corner  represent  the  highest  expected  AUC  as  well  as  least 
expected  variation  in  performance  due  to  sampling. 
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•  Malignant  Cystic  ©  Benign  Mass 

Figure  3.  2D  and  3D  visualizations  of  the  unsupervised  reduced  dimension  representations  of  the  entire 
originally  81  dimensional  breast  lesion  ultrasound  feature  dataset;  green  data  points  signifying  benign 
lesions,  red:  malignant,  and  yellow:  benign-cystic,  (a)  Visualization  of  linear  reduction  using  PCA,  first  two 
principal  components,  (b)  first  three  principal  components,  3D  PCA.  (c)  2D  and  (d)  3D  visualization  of  the 
non-linear  reduction  mapping  using  t-SNE.  (e)  2D  and  (f)  3D  visualization  of  the  non-linear  mapping  using 
Laplacian  Eigenmaps. 
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IV.B.  2D  and  3D  Visual  Representations  of  Mappings 

Due  to  the  large  sample  size  of  the  US  feature  data,  a  high  density  of  points  is  produced  (and 
hence  the  clearest  delineation  of  structures)  in  the  reduced  dimension  mapping  representations. 
Figure  3(a-f)  provides  visual  representations  of  the  entire  originally  81  dimensional  US  feature  data 
mapped  into  2D  and  3D  Euclidean  space  by  the  unsupervised  PCA,  t-SNE,  and  Laplacian 
Eigenmaps.  The  data  points  were  subsequently  colored  to  reflect  the  distribution  of  the  lesions  types 
(malignant  tumor,  benign  lesion,  cyst)  with  the  reduced  space. 

Two  key  aspects  are  considered  regarding  the  respective  mappings:  natural  class  separability 
and  overall  geometric  traits  characteristic  of  the  represented  structures,  such  as  smoothness  and 
sparsity.  PCA  is  shown  in  Figures  3(a)  and  3(b).  Certain  regions  are  potentially  identifiable  as 
being  associated  with  a  specific  class  (such  as  the  dominance  of  cystic-benign  points  in  the  bottom 
right  corner  of  the  2D  plot),  however,  PCA  generates  a  relatively  homogeneous,  nearly  spherical 
distribution  of  points.  Reflective  of  its  mathematical  basis,  PCA  representations  provide  primarily 
global  information  content,  lacking  the  capability  to  represent  rich  local  data  structure.  t-SNE 
generates  a  dramatically  different  type  of  low  dimensional  representation.  As  shown  in  figures  3(c) 
and  3(d),  t-SNE  produces  a  highly  non-linear,  jagged,  and  highly  sparse  data  mapping.  Many 
isolated  “island-like”  sub-groupings  are  identifiable  in  the  t-SNE  visual  representations.  As 
predicted  by  the  high  classification  performance  even  for  2D  and  3D,  t-SNE  manages  to  clearly 
capture  inherent  class  structure  associations.  Lastly,  the  Laplacian  Eigenmap,  Figure  3(e)  and  3(f), 
creates  globally  sparse,  yet  locally  smooth  representations.  As  captured  by  the  figures,  the  distinctly 
triangular  form  in  2D  is  revealed  as  a  projected  aspect  of  a  more  complex,  yet  smoothly  connected 
3D  geometric  structure.  As  evident  by  upper  “ridge”  of  malignant  (red)  lesion  points  and  broad 
cystic  (yellow)  “fin”  on  the  left,  the  Laplacian  Eigenmap  also  manages  to  capture  inherent  class 
associations. 
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Figure  4.  3D  visualization  of  the  unsupervised  local  structure  preserving  non-linear  dimension 
reduction  representation  using  Laplacian  Eigenmaps  on  breast  lesion  feature  data,  (a)  3D 
visualization  of  the  entire  originally  31  dimensional  DCE-MRI  feature  data,  green  data  points 
signify  benign  lesions,  red:  malignant-lDC,  and  blue:  malignant-DCIS  .  (b)  3D  visualization  of  the 
entire  originally  40  dimensional  FFDM  feature  data,  green  points  for  benign  and  red  for  malignant 
lesions. 

The  FFDM  and  DCE-MRI  visual  representations  are  noisier  than  the  US  due  to  the  smaller 
sample  size.  A  few  examples  are  provided  in  Figure  4(a,b).  The  MRI  dataset  clearly  exhibits  a 
sparse  arc-like  geometric  structure  using  the  Laplacian  Eigenmap.  This  structure  seemingly 
separates  the  bulk  of  benign  (green)  lesions  from  the  IDC  (red)  while  dispersing  the  DCIS  (blue) 
cases  in  between. 


V.  Discussion 

V.A  Dimension  Reduction  in  CADx 

Three  major  conclusions  can  be  made  regarding  the  use  of  DR  techniques  in  breast  CADx 
from  this  study.  First,  and  most  importantly,  information  critical  for  the  classification  of  breast 
mass  lesions  contained  within  the  original  high-dimensional  CADx  feature  vectors  is  not  destroyed 
by  applying  the  unsupervised,  non-linear  DR  and  representation  techniques  of  t-SNE  and  Laplacian 
Eigenmaps.  This  observation  is  strongly  supported  by  the  robustness  of  the  classification 
performance  across  the  three  different  imaging  modalities,  US,  DCE-MRI,  and  FFDM. 

Second,  according  to  the  statistical  re-sampling  validation  methods,  the  DR-based 
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classification  performance  characteristics  appear  to  potentially  rival  or  in  some  cases  exceed  that  of 
traditional  feature-selection  based  techniques.  Additionally,  both  the  linear  PCA  and  non-linear  t- 
SNE  and  Laplacian  Eigenmap  methods  often  generated  “tighter”  95%  empirical  bootstrap  intervals, 
implying  reduced  variance  in  classifier  output,  as  compared  to  the  feature  selection  based 
approaches,  especially  ARD,  see  Figure  (4).  For  instance,  in  the  large  US  dataset,  the  performance 
for  13  ARD  selected  features  was  AUC0  632+  =  0.88  with  95%  empirical  bootstrap  interval 
[0.787;0.895]  and  for  4  LSW  selected  features  was  AUC0.632+  =  0.87  with  interval  [0.8 1 7 ;0.906] 
compared  to  4D  t-SNE  mapping  (from  the  original  8 ID  feature  space)  giving  AUC0.632+  =  0.90  with 
interval  [0.847 ;0.9 19].  These  findings  imply  that  the  generally  non-linear  manifold,  on  which  US 
feature  data  exists,  embedded  in  four  dimensional  Euclidean  space  can  adequately  represent  the 
critical  information  for  classification.  These  results  build  evidence  for  some  potential  benefits  of 
employing  the  information-preserving,  DR  techniques  in  place  of  explicit  feature  selection,  including 
the  avoidance  of  the  “curse  of  dimensionality”. 

Third,  the  non-linear  DR  techniques  generated  visually -rich  embedded  mappings  with  a 
geometric  structure  that  often  presented  sparse  separation  between  class  categories,  as  demonstrated 
in  Figure  3(b):  malignant,  benign,  cyst,  and  Figure  4(a):  benign,  DCIS,  IDC.  The  natural  class 
associations  visible  in  the  mappings  are  not  totally  unexpected  since,  as  explored  above,  the 
classification  performance  results  clearly  demonstrate  the  reduced  mapping’s  capacity  to  retain 
sufficient  information  for  class-discrimination.  The  large  sample  number  of  the  US  dataset  provided 
the  most  vivid  visualizations,  highlighting  both  the  geometric  forms  and  sparse  quality  of  the  non¬ 
linear  embeddings.  Although  PCA  retained  high  supervised  classification  performance,  unlike  the 
non-linear  Laplacian  Eigenmaps  and  t-SNE  embeddings,  Figures  3(d), 3(f),  PCA  is  not  capable  of 
adequately  representing  the  data’s  inherent  local  structural  properties,  Figure  3(b),  leading  to  less 
informative  visualizations.  Yet,  the  two  non-linear  methods  offer  distinct  perspectives  on  the  data 
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structures.  The  Laplacian  Eigenmap  appears  to  perhaps  frame  the  lesions  in  a  more  globally  smooth 
context  as  evidenced  by  the  gradual  transitions  between  distant  regions  of  the  geometric  form, 
whereas  t-SNE  creates  many  distinct  jagged  “islands”  of  clustered  lesion  points.  These  emergent 
characteristics  reflect  the  theoretically  motivated  principles  driving  the  respective  non-linear  DR 
algorithms. 

V.B  Reduction  Method  Parameters 

We  briefly  explored  the  impact  of  the  parameter  selection  towards  performance  and  visual 
appearance.  To  our  knowledge  there  is  no  principled  way  to  optimally  select  a  parameter 
configuration,  thus  we  simply  choose  parameters  that  gave  reasonable  mappings  as  discemable  in 
the  2D/3D  representations.  This  is  a  problem  in  general  for  many  unsupervised  techniques.  In  fact, 
as  t-SNE  creators  noted2,  the  method  was  primarily  considered  for  visualization  purposes  and  not 
explicitly  for  DR  beyond  3  dimensions.  Performance  of  t-SNE  is  not  well  understood  for  the  general 
purpose  of  DR  and  subsequent  classification.  Future  work  may  be  of  interest  to  discover  procedures 
for  identifying  “optimal”  or  “near-optimal”  subsets  of  parameters  for  CADx  or  similar  machine 
learning  purposes. 

V.C.  Classifiers  and  Feature  Selection 

In  considering  classifier  design,  one  desires  to  be  “as  simple  as  possible,  but  no  simpler,” 
meaning  the  most  robust  scheme  in  terms  of  both  performance  and  stability  (low  variability  in 
performance  between  different  samples  from  the  same  underlying  distribution),  all  while  attempting 
to  constrain  the  number  of  parameters,  namely  the  input  space  dimension.  Additionally,  simpler 
models  facilitate  future  repeatability  with  new  contexts  and  datasets.  The  degree  to  which  such 
pursuits  are  successful  is  dependent  upon  the  interplay  of  the  three  main  aspects  affecting  the 
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performances  of  the  classifiers  including:  sample  size,  data  complexity,  and  model 
complexity/regularization.  Naturally  included  within  the  scope  of  the  model 
complexity/regularization  is  the  choice  of  inputs  to  the  classifier,  whether  in  the  form  of  DR 
mappings  or  a  set  of  selected  features,  as  this  also  critically  influences  ultimate  classification 
capability.  Ideally,  any  classifier’s  aim  is  to  synthesize  the  information  available  from  the  input 
space  in  a  complete  and  unbiased  fashion  towards  accomplishing  the  decision  task.  In  general, 
classification  of  new  input  based  on  finite  training  dataset  is  an  “ill-posed”  problem,  and  regardless 
of  the  sophistication  of  regularization  employed,  instability  may  persist.  15  For  these  reasons  both 
the  LDA  and  MCMC-BANN  were  investigated.  By  spanning  over  three  different  imaging 
modalities  of  varying  data  set  size,  using  two  different  classifiers,  and  employing  three  different 
feature  space  approaches,  all  three  of  these  key  concepts  (sample  size,  sample  complexity,  and 
model  complexity)  were  touched  upon  in  the  course  of  this  investigation. 

For  the  relatively  large  US  dataset,  with  1 126  unique  lesions  making  up  2956  lesion  images, 
some  of  the  relative  strengths  associated  with  the  more  general,  non-linear  MCMC-BANN  were 
particularly  apparent.  Specifically,  the  MCMC-BANN,  when  paired  with  either  the  DR  techniques 
or  LSW-FS  was  able  to  achieve  high  AUCo.632+  performance,  even  at  low  input  space  dimensions,  as 
seen  in  Figure  1(a).  This  is  in  part  due  to  the  MCMC-BANN  ability  to  generalize  to  any  target 
distribution,  yet  remain  relatively  well  regularized,  thereby  avoiding  “over-fitting”  and  severe 
underperformance  on  testing  data.  Yet,  critically,  when  relying  on  explicit  feature  selection,  across 
all  input  space  dimension  sizes  for  the  FFDM  and  MRI  data,  and  when  fewer  than  9  features  were 
selected  for  the  US  data,  the  MCMC-BANN’s  success  was  contingent  upon  the  use  of  LSW-FS  over 
ARD-FS.  The  MCMC-BANN  severely  underperformed  when  coupled  with  the  ARD-FS,  especially 
when  limited  to  picking  only  a  few  features.  The  smaller  AUC0.632+.  and  higher  bootstrap  variability 
(most  dramatically  evident  for  the  lower  input  space  dimensions),  reveals  limitations  in  ARD-FS 
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ability  to  consistently  identify  smaller  sub-sets  of  features  capable  of  robustly  contributing  to  the 
classification  task.  This  limitation  may  be  in  part  due  to  ARD’s  capacity  for  discovering  non-linear 
associations,  which  may  vary  highly  between  different  bootstrapped  sub-samples,  as  well  as  its  less 
direct  approach  (compared  to  LSW)  in  feature  determination. 

Turning  to  LDA,  while  not  best  suited  to  model  the  non-linear  DR  mappings,  the  robustness 
and  stability  of  LDA  shines  when  joined  with  LSW-FS  for  classification  purposes.  LDA  is,  in  a 
sense,  naturally  regularized  by  its  linear  nature  and  thus  automatically  avoids  severe  over-fitting 
situations.  Often,  the  relative  advantage  of  a  more  complex  classifier,  such  as  MCMC-BANN,  over 
LDA,  may  begin  to  erode  as  sample  size  decreases,  even  if  the  underlying  distribution  is  not 
completely  linear  in  nature.  These  phenomena  are  apparent  for  the  much  smaller  FFDM  (245 
unique  cases,  on  735  images)  and  DCE-MRI  (356  unique  lesions/images)  datasets,  as  the  less 
sophisticated  LDA  often  produced  the  highest  AUCo.632+  values.  The  LDA  classifier  showed  the 
greatest  strength  with  the  MRI  data,  nearly  matching  the  LSW-MCMC-BANN  and  similarly  for  the 
DR  approaches. 

Furthermore,  in  examining  Figure  2  again,  among  points  falling  within  desirable 
performance  specifications  (upper  left-hand  corner:  high  classification  performance/lower  expected 
variability),  it  is  reasonable  to  favor  configurations  which  require  the  lowest  input  space 
dimensionality,  as  discussed  previously  (either  the  number  selected  features  or  target  embedded 
mapping  dimensions).  A  potential  advantage  of  DR  is  that  it  may  reduce  the  amount  of  necessary 
parameters  (not  including  the  unsupervised  transformation  characterized  by  the  data  itself)  required 
to  form  a  satisfactory  data  representation  suitable  for  robust  classification.  In  fact,  most  motivation 
for  performing  DR  is  lost  if  the  target  dimension  is  not  considerably  lower  than  the  original  high 
dimensional  space.  This  is  because  such  mapped  representations  become  less  efficient  compared  to 
simply  making  use  of  the  original  feature  space  or  selected  sub-space  as  dimensions  are  added. 
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Thus,  within  the  framework  of  these  criteria,  in  reviewing  the  results  from  the  three  modalities  on 
whole,  one  may  postulate,  that  as  an  overall  strategy,  4D  t-SNE  appears  likely  to  produce 
competitive  classification  performance  when  used  as  input  into  a  non-linear  classifier  such  as  the 
MCMC-BANN.  Such  classification  performance  coupled  with  the  intriguing  2D  and  3D 
visualizations  of  the  overall  data  structure  may  evoke  attractive  research  potential. 

In  practice,  it  should  be  noted  that,  with  the  sole  intention  of  maximizing  classification 
performance  based  on  finite  sample  training  data,  there  may  be  no  clear  advantage  for  use  of  DR 
techniques  over  traditional  feature  selection.  Although,  again,  due  to  the  “curse  of  dimensionality,” 
as  the  input  space  dimension  for  classification  becomes  higher  in  dimension,  eventually  cross- 
validation  based-performance  will  stagnant  or  even  begin  to  regress  lower.  This  occurs  as  the  dataset 
sample  size  is  not  sufficient  to  adequately  isolate  a  unique  classifier  solution  (as  many,  potentially 
infinite,  become  possible)  and  marginal,  if  not  none  at  all,  new  information  is  gained  by  the 
additional  dimensions.  Thus,  for  these  reasons  and  in  order  to  compare  each  dataset  on  common 
ground,  the  tests  were  limited  to  1D-13D. 
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US  Classifier  AUC0  632+ Performance  with  95%  Confidence  Intervals 
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ALaplacian 

XLSW-FS 


LDA  AUC0  632+ 


Figure  5.  The  0.632+  bootstrapped  area  under  the  ROC  curve  is  shown  for  MCMC- 
BANN  (vertical  axis)  versus  LDA  (horizontal  axis)  with  95%  empirical  bootstrap 
confidence  intervals  included,  for  the  originally  81  dimensional  US  feature  dataset 
dimension  reduced  input  or  with  LSW  selected  features. 


VI.  Conclusion 

The  ability  to  capture  high-dimensional  data  structure  in  a  human  interpretable  low¬ 
dimensional  representation  is  a  powerful  research  tool.  The  above  findings  strongly  suggest  the 
relevance  of  non-linear  DR  and  representation  techniques  to  future  CADx  research.  DR  cannot  be 
expected  to  replace  the  benefits  of  feature  selection  based  approaches  in  many  cases.  Yet,  these 
techniques,  in  addition  to  competitive  classification  performance,  do  offer  complementary 
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information  and  a  fresh  perspective  on  interpreting  the  overall  structure  of  the  feature  data.  Of 
interest  to  future  studies  is  to  further  investigate  the  origin,  meaning,  and  physical  interpretation  of 
the  discovered  structures  present  in  the  CADx  lesion  data  as  revealed  by  these  non-linear,  local- 
geometry  preserving  representations.  Such  rich  data  structure  representations  may  offer  novel 
insights  and  useful  understandings  of  clinical  CADx  image  data. 
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VIII.  Appendix 

VIII.  A.  Laplacian  Eigenmaps  Algorithm  Outline 

Beginning  with  k  input  points,  xh  xk,  in  R1: 

Step  1 :  Construct  the  Adjacency  Graph:  Generate  a  graph  with  edges  connecting  nodes  i 

and  j 
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if  Xj  and  x;  are  “close.”  Closeness  is  defined  by  the  nodes  included  in  the  N  nearest 
neighbors.  This  relation  is  naturally  symmetric  between  points  i  and  j.  The 
parameter  N 

must  be  selected. 

Step  2:  Choosing  Weight:  The  “heat  kernel”  is  used  to  assign  weights  to  edge  connected 

nodes i 

and  j  :  Wr=  exp(-\  \xrXj\  \2/t).  Otherwise  use  Wk]  =  0  for  unconnected  vertices.  See 
Belkin  and  Niyogi  for  kernel  justification1 .  The  parameter  t  is  user  defined.  If  t  is 
set  very  high,  or  approximately,  t  =  oo ,  the  edge  connected  node  weights  are 
essentially  W-n=\,  this  option  can  be  used  to  avoid  parameter  selection. 

Step  3:  Computing  Eigenmaps:  Assuming  a  connected  graph  generated  in  step  1,  G,  solve 
for 

the  following  eigenvector  and  eigenvalues:  Lf=  XDf  ,  where  D  is  the  diagonal 

weight 

matrix,  defined  by  summing  over  the  rows  of  W .  Dn  =2j  WV}  and  L  is  the  Laplacian 
matrix  defined  as:  L=D-W.  Symmetric  and  positive  semi-definite,  conceptually  the 
Laplacian  matrix  acts  as  an  operator  on  functions  defined  by  graph  G”s  vertices. 
Solving  the  equation,  let  f0,. . .,  fk_,be  the  eigenvectors,  arranged  in  accordance  to  their 
eigenvalues:0  =  a0<  a,<  . . .  <Xk.  Lf0  =  X0Df()  Lfk_!  =  Xk_,DfkA. 

Finally,  the  k  input  data  points  in  R1  are  embedded  in  m -dimensional  Euclidean  space  using  the  m 
eigenvectors  after  the  zero  eigen-valued  f0,  xt  — >■  (f,  (/),...,  f,„ (/)) 

VIII,  B.  t-SNE  Algorithm  Outline 
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Beginning  with  k  input  points,  {xh  xk}  in  R;,  set  Perplexity  parameter,  Perp,  number  of 
iterations  T,  learning  rate  r\,  and  momentum  a(t). 

Step  1.  Compute  Similarities :  Compute  pairwise  pt ,  probabilities  using  the  cr,  found  with 
perplexity  Perp ,  and  use  symmetrized  conditional  probability  distributions 

Pij  =  (Pj\i  +  Pi^2k 

Step  2.  Initialize  Solution  Sample:  Sample  from  MO,  1  (f4/"')  for  initial  points  {yh  yk} 
Step  3.  Execute  T  Update  Iterations  on  Y:  Compute  low-dimension  similarities  q(/  using  eq. 
(4) 

and  gradient  using  eq(5).  Update  Y  using  y(0  =  y(w)  +  ^  —  +  a(0(Y(^1)-Y('_2)) 

Output:  Low-dimension  mapping  \yh  yk\  inR"' 
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